---
title: Multimodal with GPT-4V and LLaVA
authors: beibinli
tags: [LMM, multimodal]
---

![LMM Teaser](img/teaser.png)

**In Brief:**
* Introducing the **Multimodal Conversable Agent** and the **LLaVA Agent** to enhance LMM functionalities.
* Users can input text and images simultaneously using the `<img img_path>` tag to specify image loading.
* Demonstrated through the [GPT-4V notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_gpt-4v.ipynb).
* Demonstrated through the [LLaVA notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb).

## Introduction
Large multimodal models (LMMs) augment large language models (LLMs) with the ability to process multi-sensory data.

This blog post and the latest AutoGen update concentrate on visual comprehension. Users can input images, pose questions about them, and receive text-based responses from these LMMs.
We support the `gpt-4-vision-preview` model from OpenAI and `LLaVA` model from Microsoft now.

Here, we emphasize the **Multimodal Conversable Agent** and the **LLaVA Agent** due to their growing popularity.
GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2.

## Installation
Incorporate the `lmm` feature during AutoGen installation:

```bash
pip install "pyautogen[lmm]"
```

Subsequently, import the **Multimodal Conversable Agent** or **LLaVA Agent** from AutoGen:

```python
from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent  # for GPT-4V
from autogen.agentchat.contrib.llava_agent import LLaVAAgent  # for LLaVA
```

## Usage

A simple syntax has been defined to incorporate both messages and images within a single string.

Example of an in-context learning prompt:

```python
prompt = """You are now an image classifier for facial expressions. Here are
some examples.

<img happy.jpg> depicts a happy expression.
<img http://some_location.com/sad.jpg> represents a sad expression.
<img obama.jpg> portrays a neutral expression.

Now, identify the facial expression of this individual: <img unknown.png>
"""

agent = MultimodalConversableAgent()
user = UserProxyAgent()
user.initiate_chat(agent, message=prompt)
```

The `MultimodalConversableAgent` interprets the input prompt, extracting images from local or internet sources.

## Advanced Usage
Similar to other AutoGen agents, multimodal agents support multi-round dialogues with other agents, code generation, factual queries, and management via a GroupChat interface.

For example, the `FigureCreator` in our [GPT-4V notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_gpt-4v.ipynb) and [LLaVA notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb) integrates two agents: a coder (an AssistantAgent) and critics (a multimodal agent).
The coder drafts Python code for visualizations, while the critics provide insights for enhancement. Collaboratively, these agents aim to refine visual outputs.
With `human_input_mode=ALWAYS`, you can also contribute suggestions for better visualizations.

## Reference
- [GPT-4V System Card](https://openai.com/research/gpt-4v-system-card)
- [LLaVA GitHub](https://github.com/haotian-liu/LLaVA)

## Future Enhancements

For further inquiries or suggestions, please open an issue in the [AutoGen repository](https://github.com/microsoft/autogen/) or contact me directly at beibin.li@microsoft.com.

AutoGen will continue to evolve, incorporating more multimodal functionalities such as DALLE model integration, audio interaction, and video comprehension. Stay tuned for these exciting developments.
